Dimensionality Reduction with Multilingual Resource

نویسندگان

  • Yingju Xia
  • Hao Yu
  • Gang Zou
چکیده

Query and document representation is a key problem for information retrieval and filtering. The vector space model (VSM) has been widely used in this domain. But the VSM suffers from high dimensionality. The vectors built from documents always have high dimensionality and contain too much noise. In this paper, we present a novel method that reduces the dimensionality using multilingual resource. We introduce a new metric called TC to measure the term consistency constraints. We deduce a TC matrix from the multilingual corpus and then use this matrix together with the termby-document matrix to do the Latent Semantic Indexing (LSI). By adopting different TC threshold, we can truncate the TC matrix into small size and thus lower the computational cost of LSI. The experimental results show that this dimensionality reduction method improves the retrieval performance significantly.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

2D Dimensionality Reduction Methods without Loss

In this paper, several two-dimensional extensions of principal component analysis (PCA) and linear discriminant analysis (LDA) techniques has been applied in a lossless dimensionality reduction framework, for face recognition application. In this framework, the benefits of dimensionality reduction were used to improve the performance of its predictive model, which was a support vector machine (...

متن کامل

A Monte Carlo-Based Search Strategy for Dimensionality Reduction in Performance Tuning Parameters

Redundant and irrelevant features in high dimensional data increase the complexity in underlying mathematical models. It is necessary to conduct pre-processing steps that search for the most relevant features in order to reduce the dimensionality of the data. This study made use of a meta-heuristic search approach which uses lightweight random simulations to balance between the exploitation of ...

متن کامل

Multilingual Recurrent Neural Networks with Residual Learning for Low-Resource Speech Recognition

The shared-hidden-layer multilingual deep neural network (SHL-MDNN), in which the hidden layers of feed-forward deep neural network (DNN) are shared across multiple languages while the softmax layers are language dependent, has been shown to be effective on acoustic modeling of multilingual low-resource speech recognition. In this paper, we propose that the shared-hidden-layer with Long Short-T...

متن کامل

University of Chicago at CLEF2004: Cross-language Text and Spoken Document Retrieval

The University of Chicago participated in the Cross-Language Evaluation Forum 2004 (CLEF2004) cross-language multilingual, bilingual, and spoken language tracks. Cross-language experiments focused on meeting the challenges of new languages with freely available resources. We found that modest e ectiveness could be achieved with the additional application of pseudo-relevance feedback to overcome...

متن کامل

Diagnosis of Diabetes Using an Intelligent Approach Based on Bi-Level Dimensionality Reduction and Classification Algorithms

Objective: Diabetes is one of the most common metabolic diseases. Earlier diagnosis of diabetes and treatment of hyperglycemia and related metabolic abnormalities is of vital importance. Diagnosis of diabetes via proper interpretation of the diabetes data is an important classification problem. Classification systems help the clinicians to predict the risk factors that cause the diabetes or pre...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008